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Abstract 

We address the problem of efficiently gathering correlated data from a wired or a wireless sensor 
network, with the aim of designing algorithms with provable optimality guarantees, and understanding 
how close we can get to the known theoretical lower bounds. Our proposed approach is based on finding 
an optimal or a near-optimal compression tree for a given sensor network: a compression tree is a 
directed tree over the sensor network nodes such that the value of a node is compressed using the value 
of its parent. We consider this problem under different communication models, including the broadcast 
communication model that enables many new opportunities for energy-efficient data collection. We draw 
connections between the data collection problem and a previously studied graph concept, called weakly 
connected dominating sets, and we use this to develop novel approximation algorithms for the problem. 
We present comparative results on several synthetic and real-world datasets showing that our algorithms 
construct near-optimal compression trees that yield a significant reduction in the data collection cost. 

1 Introduction 

In this paper, we address the problem of designing energy-efficient protocols for collecting all data observed 
by the sensor nodes in a sensor network at an Internet-connected base station, at a specified frequency. 
The key challenges in designing an energy-efficient data collection protocol are effectively exploiting the 
strong spatio-temporal correlations present in most sensor networks, and optimizing the routing plan for 
data movement. In most sensor network deployments, especially in environmental monitoring applications, 
the data generated by the sensor nodes is highly correlated both in time (future values are correlated with 
current values) and in space (two co-located sensors are strongly correlated). These correlations can usually 
be captured by constructing predictive models using either prior domain knowledge, or historical data traces. 
However, the distributed nature of data generation and the resource-constrained nature of the sensor devices, 
make it a challenge to optimally exploit these correlations. 

Consider an n-node sensor network, with node i monitoring the value of a variable Xi, and generating 
a data flow at entropy rate of H{Xi). In the naive protocol, data from each source is simply sent to the 
base station through the shortest path, rendering a total data transmission cost ^ H{Xi) ■ d(i, BS), where 
d(i, BS) is the length of a shortest path to the base station. However, because of the strong spatial correla- 
tions among the Xi, the joint entropy of the nodes, H(X\, . . . , X n ), is typically much smaller than the sum 
of the individual entropies; the naive protocol ignores these correlations. 
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A lower bound on the total number of bits that need to be communicated can be computed using the 
Distributed Source Coding (DSC) theorem |[25ll30l[3Tll26l . In their seminal work, Slepian and Wolf |[25l 
prove that it is theoretically possible to encode the correlated information generated by distributed data 
sources (in our case, the sensor nodes) at the rate of their joint entropy even if the data sources do not 
communicate with each other. This can be translated into the following lower bound on the total amount of 
data transmitted for a multi-hop network: Y2% BS)xH(Xi\X\, . . . , where X\, . . . , X n are sorted 

in an increasing order by their distances to the base station [7] [26]]. With high spatial correlation, this number 
is expected to be much smaller than the total cost for the naive protocol (i.e., H(Xi\X\, . . . <C 
H(Xi)). The DSC result unfortunately is non-constructive, with constructive techniques known for only a 
few specific distributions 11221 : more importantly, DSC requires perfect knowledge of the correlations among 
the nodes, and may return wrong answers if the observed data values deviate from what is expected. 

However, the lower bound does suggest that significant savings in total cost are possible by exploiting 
the correlations. Pattern et al. ETI . Chu et al. J6), Cristescu et al. JH, among others, propose practical data 
collection protocols that exploit the spatio-temporal correlations while guaranteeing correctness (through 
explicit communication among the sensor nodes). These protocols may exploit only a subset of the corre- 
lations, and in many cases, assume uniform entropies and conditional entropies. Further, most of this prior 
work has not attempted to provide any approximation guarantees on the solutions, nor have they attempted a 
rigorous analysis of how the performance of the proposed data collection protocol compares with the lower 
bound suggested by DSC. 

We are interested in understanding how to get as close to the DSC lower bound as possible for a given 
sensor network and a given set of correlations among the sensor nodes. In a recent work, Liu et al. |[T9l con- 
sidered a similar problem to ours and developed an algorithm that performs very well compared to the DSC 
lower bound. However, their results are implicitly based on the assumption that the conditional entropies 
are quite substantial compared to the base variable entropies (specifically, that H{Xi\X\, Xi-\) is lower 
bounded). Our results here are complimentary in that, we specifically target the case when the conditional 
entropies are close to zero (i.e., the correlations are strong), and we are able to obtain approximation algo- 
rithms for that case. We note that we are also able to prove that obtaining better approximation guarantees 
is NP-hard, so our results are tight for that case. As we will see later, lower bounding conditional entropies 
enables us to get better approximation results and further exploration of this remains a rich area of future 
work. 

In this paper, we analyze the data collection problem under the restriction that any data collection proto- 
col can directly utilize only second-order marginal or conditional probability distributions - in other words, 
we only directly utilize pair-wise correlations between the sensor nodes. There are several reasons for study- 
ing this problem. First off, the entropy function typically obeys a strong diminishing returns property in that, 
utilizing higher-order distributions may not yield significant benefits over using only second-order distribu- 
tions. Second, learning, and utilizing, second-order distributions is much easier than learning higher-order 
distributions (which can typically require very high volumes of training data). Finally, we can theoretically 
analyze the problem of finding the optimal data collection scheme under this restriction, and we are able to 
develop polynomial-time approximation algorithms for solving it. 

This restriction leads to what we call compression trees. Generally speaking, a compression tree is 
simply a directed spanning tree T of the communication network in which the parents are used to com- 
press the values of the children. More specifically, given a directed edge (u, v) in T, the value of X v is 
compressed using the value of xJ3 (i.e., we use the value of X u = x u to compute the conditional distribu- 
tion p(X v \X u = x u ) and use this distribution to compress the observed value of X v (using say Huffman 

'in the rest of the paper, we denote this by X v \X U 



2 




H(X 3 IX 



H(X 2 IX-,) 




H(X 3 IX-,) 
H(X 4 IX 3 ) 





Base Station 




H(X 2 I 


<1>L 








^ \ 







Cost = H(X.,) + H(X 2 ) + 2 * H(X 3 ) 
+ 3 * H(X 4 ) + 2 ' H(X 5 ) 
= 9 

(i) IND 



H(X 5 IX 1 X 2 ) 



Cost = H(X 1 ) + H(X 2 X 5 ) + 2 * H(X 3 X 4 ) Cost = H(X 1 ) + H(X 2 IX 1 ) + 2 * H(XglX 1 XgXg) 
+ H(X 4 ) + H(X 5 ) 3 * H(X 4 IX 1 X 2 X 3 X 5 ) + 2 « H(X 5 IX 1 X 2 ) 

= 1 + (1+E) + 2*(1+E) + 1+1=6 + 3E = 1+ E + 2*£ + 3*£ + 2*£ = 1+8£ 

(ii) Cluster 



HfXjIX,) 



Cost = H(X 1 ) + H(X 2 IX 1 ) + 2 * H(X 3 IX 1 ) + 
H(X 4 ) + 2 * H(X 4 IX 3 ) + 2 * H(X 5 IX 1 ) 
=1+£+2*E+1+2*E+2*E=2+7E 



Figure 1: Illustrating different data collection approaches 
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1 — > 3, 1 — > 5 and 3^4 (the cost under WN model would have been 5 + 7e). 
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coding)). The compression tree also specifies a data movement scheme, specifying where (at which sensor 
node) and how the values of X u and X v are collected for compression. 

The compression tree-based approach can be seen as a special case of the approach presented by one of 
the authors in prior work |[28l . There the authors proposed using decomposable models for data collection 
in wireless sensor networks, of which compression trees can be seen as a special case. However, that work 
only presented heuristics for solving the problem, and did not present any rigorous analysis or approximation 
guarantees. 



2 Problem Definition 

We begin by presenting preliminary background on data compression in sensor networks, discuss the prior 
approaches, and then introduce the compression tree-based approach. 

2.1 Notation and Preliminaries 

We are given a sensor network modeled as an undirected, edge-weighted graph Qc{V = {1, • • • , n}, E), 
comprising of n nodes that are continuously monitoring a set of distributed attributes X = {X\, • • • , X n }. 
The edge set E consists of pairs of vertices that are within communication radius of each other, with the edge 
weights denoting the communication costs. Each attribute, X- t , observed by node i, may be an environmental 
property being sensed by the node (e.g., temperature), or it may be the result of an operation on the sensed 
values (e.g., in an anomaly-detection application, the sensor node may continuously evaluate a filter such 
as "temp > 100" on the observed values). If the sensed attributes are continuous, we assume that an error 
threshold of e is provided and the readings are binned into intervals of size 2e to discretize them. In this 
paper, we focus on optimal exploitation of spatial correlations at any given time t; our approach can be 
generalized to handle temporal correlations in a straightforward manner. 

We are also provided with the entropy rate for each attribute, H(Xi) (1 < i < n) and the conditional 
entropy rates, H{Xi\Hj) (1 < i, j < n), over all pairs of attributes. More generally, we may be provided 
with a joint probability distribution, p{X\, ...,X n ), over the attributes, using which we can compute the 
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joint entropy rate for any subset of attributes. However accurate computation of such joint entropies for 
large subsets of attributes is usually not feasible. 

We denote the set of neighbors of the node i by N(i) and let N(i) = N(i)U{i} and deg(i) = |iV(i)|. We 
denote by d(i, j) the energy cost of communicating one bit of information along the shortest path between 
i and j. 

We consider the following communication cost models in this paper. The data movement schemes and 
how the costs are counted differ among different models. 

Wireless Network (WL): In this model, when a node transmits a message, all its neighbors can hear the 
message (broadcast model). We further assume that the energy cost of receiving such a broadcast mes- 
sage is negligible, and we only count the cost of transmitting the message. If the unicast protocol is used, 
the network behaves as a wired network (see below). 

Wired Network(WN): Here we assume point-to-point communication without any broadcast functionality. 
Each communication link is weighted, denoting the cost of transmitting one bit of message through this 
link. 

Multicast: When a sender needs to communicate a piece of information to multiple receivers, we allow 
for sharing of transmissions. Namely, a message can be sent from the source to a set of terminals 
through a Steiner tree. 

Unicast: Each communication is between two node (one sender and one receiver). Different message 
transmissions cannot be shared and the cost of each communication is counted separately. 

2.2 Prior Approaches 

Given the entropy and the joint entropy rates for compressing the sensor network attributes, the key issue 
with using them for data compression is that the values are generated in a distributed fashion. The naive 
approach to using all the correlations in the data is to gather the sensed values at a central sensor node, and 
compress them jointly. However, even if the compression itself was feasible, the data gathering cost would 
typically dwarf any advantages gained by doing joint compression. Prior research in this area has suggested 
several approaches that utilize a subset of correlations instead. Several of these approaches are illustrated in 
Figured] using a simple 5-node sensor network. 

IND: Each node compresses its own value, and sends it to the base station along the shortest path. The total 
communication cost is given by £V d(i, BS) ■ H{Xi). 

Cluster: In this approach ll2Tl l6l. the sensor nodes are grouped into clusters, and the data from the nodes 
in each cluster is gathered at a node (which may be different for different clusters) and is compressed 
jointly. Figure Q] (ii) shows an example of this using three clusters {1}, {2, 5}, {3,4}. Thus the intra- 
cluster spatial correlations are exploited during compression; however, the correlations across clusters 
are not utilized. 

Cristescu et al. ||8): The approach proposed by Cristescu et al. is similar to ours, and also only uses second- 
order distributions. They present algorithms for the WN case, further assuming that the entropies and 
conditional entropies are uniform. The solution space that we consider in this paper is larger that the one 
they consider, in that it allows more freedom in choosing the compression trees' in spite of that we are 
able to develop a PTIME algorithm for the problem they address (see Section [Jo Further, we make no 
uniformity assumptions about the entropies or the conditional entropies in that algorithm. 

2 However, they require all the communication to be along a tree; we don't require that from our solutions. 
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DSC: Distributed source coding (DSC), although not feasible in this setting for the reasons discussed 
earlier, can be used to obtain a lower bound on total communication cost as follows Q [H |26]]. Let 
the sensor nodes be numbered in increasing order by distances from the base station (i.e., for all i, 
d(i,BS) < d(i + 1,BS)). The optimal scheme for using DSC is as follows: X\ is compressed by 
itself, and transmitted directly to the sink (incurring a total cost of d(l, BS) x H(X\)). Then, X2 is 
compressed according to the conditional distribution of X2 given the value of X\, resulting in a data 
flow rate of H(X2\X\) (since the sink already has the value of X\, it is able to decode according this 
distribution). Note that, according to the distributed source coding theorem l25l . sensor node 2 does not 
need to know the actual value of X\. Similarly, Xi is compressed according to its conditional distribu- 
tion given the values of X\, . . . , Xi-\. The total communication cost incurred by this scheme is given by: 

E?=id(i,BS) x H(X l \X u ...,X i _ 1 ) 
Figure [U(iii) shows this for our running example (note that 5 is closer to sink than 3 or 4). 

RDC: Several approaches where data is compressed along the way to the base station (routing driven com- 
pression ||2T1 l23l [T2l ) have also been suggested. These however require joint compression and decom- 
pression of large numbers of data sources inside the network, and hence may not be suitable for resource- 
constrained sensor networks. 

Dominating Set-based: Kotidis |[l~8l and Gupta et al. [14], among others, consider approaches based on us- 
ing a representative set of sensor nodes to approximate the data distribution over the entire network; these 
approaches however do not solve the problem of exact data collection, and cannot provide correctness 
guarantees. 

As we can see in Figure [T] if the spatial correlation is high, both IND and Cluster incur much higher 
communication costs than DSC. For example, if H(Xi) = 1, Vi, and if H(Xi\Xj) = e « 0, Vi, j (i.e., if 
the spatial correlations are almost perfect), the total communication costs of IND, Cluster (as shown in the 
figure), and DSC would be 9, 6, and 1 respectively. 

2.3 Compression Trees 

As discussed in the introduction, in practice, we are likely to be limited to using only low-order marginal or 
conditional probability distributions for compression in sensor networks. In this paper, we begin a formal 
analysis of such algorithms by analyzing the problem of optimally exploiting the spatial correlations under 
the restriction that we can only use second-order conditional distributions (i.e., two-variable probability 
distributions). A feasible solution under this restriction is fully specified by a directed spanning tree T 
rooted at r (called a compression tree) and a data movement scheme according to T. In particular, the 
compression tree indicates which of the second-order distributions are to be used, and the data movement 
scheme specifies an actual plan to implement it. 

More formally, let p(i) denote the parent of i in T. This indicates that both Xi and X p u\ should be 
gathered together at some common sensor node, and that Xi should be compressed using its conditional 
probability distribution given the value of X p u\ (i.e., p{Xi\X p n\ = x p u))). The compressed value is com- 
municated to the base station along the shortest path, resulting in an entropy rate of H(Xi\X p r{\). Finally, 
the root of the tree, r, sends it own value directly to the base station, resulting in an entropy rate of H(X r ). 
It is easy to see that the base station can reconstruct all the values. The data movement plan specifies how 
the values of Xi and X p n^ are collected together for all i. 

In this paper, we address the optimization problem of finding the optimal compression tree that mini- 
mizes the total communication cost, for a given communication topology and a given probability distribution 
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over the sensor network variables (or the entropy rates for all variables, and the joint entropy rates for all 
pairs of variables). 

We note that the notion of compression trees is quite similar to the so-called Chow-Liu trees Q, used 
for approximating large joint probability distributions. 

Example 1: Figure \l}(iv) shows the process of collecting data using a compression tree for our running 
example, under the broadcast communication model. The compression tree (not explicitly shown) consists 
of four edges: 1— > 2, 1 — ► 3, 1— > 5 and 3^4. The data collections steps are: 

1. Sensor nodes 1 and 4 broadcast their values, using H(X\) and H(X^) bits respectively. The Base 
Station receives the value ofX\ in this step. 

2. Sensor nodes 2, 3, and 5 receive the value of X%, and compress their own values using the conditional 
distributions given X±. Each of them sends the compressed values to the base station along the shortest 
path. 

3. Sensor node 3 also receives the value of X&, and it compresses X4 using its own value. It sends the 
compressed value (at an entropy rate of H{X^\X^)) to the base station along the shortest path. 

The total ( expected) communication cost is thus given by: 

H(X{) + H(Xi) + H(X 2 \X 1 ) + 2 x H(X 3 \X 1 ) + 
2 x H(X 5 \Xi) + 2 x H{X 4 \X 3 ) 
If the conditional entropies are very low, as is usually the case, the total cost will be simply H(X\) + 
H(X A ). 

2.4 Compression Quality of a Solution 

To analyze and compare the quality of the solutions with the DSC approach, we subdivide the total commu- 
nication cost incurred by a data collection approach into two parts: 

Necessary Communication (NC): As discussed above, for practical reasons, data collection schemes typ- 
ically use a subset of the correlations present in the data (e.g. Cluster only uses intra-cluster correlations, 
our approach only uses second-order joint distributions). Given the specific set of correlations utilized by 
an approach, there is a minimum amount of communication that will be incurred during data collection. 
This cost is obtained by computing the DSC cost assuming only those correlations are present in the data. 
For a specific compression tree, the NC cost is computed as: 

H{X r ) x d(r, BS) + £ ie y H(X t \X p{i) ) x d(i, BS) 
The NC cost for the Cluster solution shown in Figure QJii) is 4 + 5e, computed as: 
H{X X ) + H(X 2 ) + 2 ■ H(X 5 \X 2 ) + 2 • H(X 3 ) + 3 ■ H(X 4 \X 3 ) 

In some sense, NC cost measures the penalty of ignoring some of the correlations during compression. 
For Cluster, this is typically quite high - compare to the NC cost for DSC (= 1 + 8e). On the other hand, 
the NC cost for the solution in Figure Q] (iv) is 1 + 8e (i.e., it is equal to the NC cost of DSC - we note 
that this is an artifact of having uniform conditional entropies, and does not always hold). 

Intra-source Communication (IC): This measures the cost of explicitly gathering the data together as 
required for joint compression. By definition, this cost is for DSC. We compute this by subtracting the 
NC cost from the total cost. For the solutions presented in Figures [T] (ii) and (iv), the IC cost is 2 — 2e 
and 1 — e respectively. The broadcast communication model significantly helps in reducing this cost for 
our approach. 
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The key advantage of our compression tree-based approach is that its NC cost is usually quite close to DSC, 
whereas the other approaches, such as Cluster, can have very high NC costs because they ignore a large 
portion of the correlations. 



2.5 Solution Space 

In our optimization algorithms, we consider searching among two different classes of compression trees. 

• Subgraphs of Q (SG): Here we require that the compression tree be a subgraph of the communication 
graph. In other words, we compress Xi using Xj only if i and j are neighbors. 

• No restrictions (NS): Here we don't put any restrictions on the compression trees. As expected, 
searching through this solution space is much harder than SG. 

In general, we expect to find the optimal solution in the SG solution space; this is because the correlations 
are likely to be stronger among neighboring sensor nodes than among sensor nodes that are far away from 
each other. 

Finally, we define j3 as the bounded conditional entropy parameter, which bounds the ratio of conditional 
entropies for any pair of variables that can be used to compress each other. Formally, ^ < gnnj^ — ft 
for any nodes i and j and some constant /3 > 1. For the SG problem, this is taken over pairs of adjacent 
nodes and for the NS problem, it is taken over all pairs. Moreover, the above property implies that the ratio 
of entropies between any pair of nodes is also bounded, jjj < jffjA < (3. 

We expect j3 to be quite small (ss 1) in most cases (especially if we restrict our search space to SG). 
Note that, if the entropies are uniform (H(Xi) = H(Xj)), then j3 = 1. 

2.6 Summary of Our Results 

Combining the distinct communication models and different solution spaces, we get four different problems 
that we consider in this paper: (1) WL-SG, (2) WL-NS, (3) Multicast-NS, and (4) Unicast (which subsumes 
Multicast-SG). We summarize the results as follows. 

1 . (Section |3~TI ) We first consider the WL-SG problem under an uniform entropy and conditional entropy 
assumption, i.e., we assume that H(Xi) = 1 Vz and H(Xi\Xj) = e Vi, j, i ^ j. We develop a 

( 1+2 (d 1 -1/21 + -0 + 2 ) -approximation for this problem, where d avg is the average distance 



to the base station. 

2. (S ection I3T21 and [331) We develop a unified generic greedy framework which can be used for approxi- 
mating the problem under various communication cost models. 

3. (Section [3~4l and |33T ) We show that, for wireless communication model, the greedy framework gives 
a A(3 2 H n approximation factor for the SG solution space and and an 0(/3 3 n e log n) (for any e > 0) 
factor for the NS solution space. 

4. (Section l3T6l ) For multicast-NS problem, we show that the greedy framework gives an 0{— (log n) 3+e ) 
(for any e > 0) approximation. 

5. (SectionlU) For the unicast communication model, we present a simple poly-time algorithm for finding 
an optimal restricted solution (defined in Section [3^21 . giving us a (2 + j3) -approximation. Further, 
we show that the optimal restricted solution is also the optimal solution under uniform entropy and 
conditional entropies assumption. 




7 



6. (Section© We illustrate through an empirical evaluation that our approach usually leads to very good 
data collection schemes in presence of strong correlations. In many cases, the solution found by our 
approach performs nearly as well as the theoretical lower bound given by DSC. 

3 Approximation Algorithms 

We first present an approximation algorithm for the WL-SG problem under the uniform entropy assumption; 
this will help us tie the problem with some previously studied graph problems, and will also form the basis 
for our main algorithms. We then present a generic greedy framework that we use to derive approximation 
algorithms for the remaining problems. 

3.1 The WL-SG Model: Uniform Entropy and Conditional Entropy Assumption 

Without loss of generality, we assume that H{Xi) = 1, \/i and H(Xi\Xj) = e j, for all adjacent pairs 
of nodes (Xi, Xj). We expect that typically e<l, 

For any compression tree that satisfies the SG property, the data movement scheme must have a sub- 
set of the sensor nodes locally broadcast their (compressed) values, such that for every edge (u, v) in the 
compression tree, either u or v (or both) broadcast their values. (If this is not true, then it is not possible to 
compress X v using X u .) Let S denote this subset of nodes. Each of the remaining nodes only transmits e 
bits of information. 

To ensure that the base station can reconstruct all the values, S must further satisfy the following prop- 
erties: (1) S must form a dominating set of Qc (any node ^ S must have a neighbor in S). (2) The graph 
formed by deleting all edges (x, y) where x, y G V \ S is connected. Property (1) implies every node should 
get at least one of its neighbors' message for compression and property (2) guarantees the connectedness of 
the compression tree given S broadcast. Graph-theoretically this leads to a slightly different problem than 
both the classical Dominating Set (DS) and Connected Dominating Set (CDS) problems[13]. Specifically, 
S must be a Weakly Connected Dominating Set (WCDS) [31 of Ga- 
in the network shown in Figure [2j nodes 4,3,9 and 10 form a WCDS, and thus locally broadcasting 
them can give us a valid compression tree (shown in Figure [2] (ii)). However, note that nodes 4, 9, 10 and 2 
form a DS but not a WCDS. As a result, we cannot form a compression tree with these nodes performing 
local broadcasts (there would be no way to reconstruct the value of both X3 and X2). 

The approach for the CDS problem that gives a 2H& approximation lTT3l . gives a + 1 approximation^] 
for WCDS 0. We use this to prove that: 

Theorem 1 Let the average distance to the base station be d avg = d ^' BS \ The approximation for 
WCSD yields a ^ i+2e(davg-i/2) + 1) + 2^ -approximation for WL-SG problem under uniform entropy 
and conditional entropies assumption. 

Proof: The amount of data broadcast is clearly |5| (since H(Xi) = 1 for i G S). Each non-broadcast 
node j sends e amount of data to BS - the cost of this is ed(j, BS). For each broadcast node j, e amount 
of data may be sent from p(j) and the cost is at most e(d(j, BS) + 1) and at least e(d(j, BS) — 1). The 
total communication cost is thus at most UB = \S\ + e(^ d(j, BS) + \S\). In an optimal solution, 
suppose S* denotes the set of nodes that perform local broadcast; then the lower bound on the total cost is: 

3 A is the maximum degree and H n is the nth harmonic number, i.e, H n = ■ 
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Figure 2: (i) A weakly connected dominating set 
of the sensor network is indicated by the shaded 
nodes, which locally broadcast their values; (ii) The 
corresponding compression tree (e.g. Node 3 is 
compressed using the value of Node 1 at Node 1, 
whereas Node 5 is compressed using the value of 
Node 4 at Node 5). 



LB=\S*\ + e(£ j d(j,BS) 
Therefore, 



Figure 3: Illustrating the Treestar algorithm: First 
the treestars centered at nodes 10, 9 and 3 are cho- 
sen, and finally the treestar centered at node 4 is 
chosen. This causes the parents of nodes 1 and 5 
to be re-defined as node 4, the parent of node 9 to 
be defined as node 5, and the parent of node 3 to 
be defined as node 1. (i) also shows an extended 
compression tree. 

5*|). We can also easily see |5*| < n/2, thus \S*\ < 1+2e( - d 1 _i/2) - 
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From the above theorem, if e is small enough, say e = o( jp— ), the approximation ratio is approximately 
H^.)- On the other hand, if e is large, the approximation ratio becomes better. Specifically, if e sa H^/^avg, 
then we get a constant approximation. This matches our intuition that the hardness of approximation comes 
mainly from the case when the correlations are very strong. We can further formalize this - by a standard 
reduction from the set cover problem which is hard to approximate within a factor of (1 — 8) Inn for any 
5 > iflOl . we can prove: 

Theorem 2 The WL-SG problem can not be approximated within a factor of (1 — 5) In nfor any 5 > even 
with uniform entropy and conditional entropy, unless NP C DTIME(n}° elosn ). 



3.2 The Generic Greedy Framework 

We next present a generic greedy framework that helps us analyze the rest of the problems. 

Suppose node p(i) is the parent of node i in the compression tree T. Let Ii )P (i) denote the node where 
Xi is compressed using X p ^y We note that this is not required to be i or j, and could be any node in 
the network. This makes the analysis of the algorithms very hard. Hence we focus on the set of feasible 
solutions of the following restricted form: ijj is either node i or j. The following lemma states that the cost 
of the optimal restricted solution is close to the optimal cost. 

Lemma 1 Let the optimal solution be OPT and the optimal restricted solution be OPT. We have cost (OPT) < 
(2 + P)cost(OPT). Furthermore, for WL-SG model, cost(OPT) < 2cost(OPT). 
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Proof: Let T* be the compression tree of OPT. We keep the compression tree unchanged and only 
modify the data movement scheme OPT to construct a restricted solution OPT whose cost is at most 
(2 + f3)cost{OPT). Assume that i is the parent of j in T* and Xj\Xi is computed at some node ijj. We 
denote by Ti the set of nodes which receive Xi from i. We simply extend Tj to be % = Ti U P(Iij,j) 
where P(u, v) is the shortest path from u to v. Then Xj \X{ is computed on node j and then sent to the base 
station. 

Now, we analyze the increase in cost for the wired network model. The proof for the wireless network 
case is almost the same and we omit it here. Let p(i) be the parent and Ch(i) be the set of children of node 
iin T*. 

cost(OPT) = ]TtfpQ) C (T 4 )+ HMXpvWiMQ'BS)- 

ieT* ieT*\{BS} 

Thus, we have: 

cost(OPT) = ^#(^)c(T;)+ Yl H(Xi\X p{i) )d(i,BS) 

ieT* i£T*\{BS} 

< ^ H ( X i) ( c ( T i) + E d fei)]+ E H{Xi\X p{i) ){d{I iA{] ,i)+d{I i>p{ 

< cost(OPT) + (3 E HiX^il^i) + E i/(X i |X p(i) )d(/ ijP(i) ,i) 

ieT* ieT* 

< cost(0 J PT) + (1 + p) E c(Ti) < (2 + l3)cost(OPT). 

For the WL-SG model, the only reason that h^i) is neither i nor p{i) is that both i and broadcast 
their values to the third node Ii^u) which is closer to the base station. The above analysis can be still carried 
over except we don't need any extra intra-source communication. Then, we don't have the (3 term in the 
formula and it gives us a ratio of 2. 

Our algorithm finds what we call an extended compression tree, which in a final step is converted to a 
compression tree. An extended compression tree T corresponding to a compression tree T has the same 
underlying tree structure, but each edge e(i,j) G T is associated with an orientation specifying the raw 
data movement. Basically, an extended compression tree naturally suggests a restricted solution in which 
an edge from i to j in T implies that i ships its raw data to j and the corresponding compression is earned 
out at j. We note that the direction of the edges in T may not be the same as in T where edges are always 
oriented from the root to the leaves, irrespective of the data movement. In the following, we refer the parent 
of node i to be the parent in T, i.e, the node one hop closer to the root, denoted by p(i). 

The main algorithm greedily constructs an extended compression tree by greedily choosing subtrees 
to merge in iterations. We start with a empty graph T\ that consists of only isolated nodes. During the 
execution, we maintain a forest in which each edge is directed. In each iteration, we combine some trees 
together into a new larger tree by choosing the most (or approximately) cost-effective treestar (defined later). 
Let the forest at the start of the «th iteration be Ti. A treestar TS is specified by k trees in Ti, say T\, , . . , T^, 
a node r ^ Tj(l < j < k) and k directed edges e-,- = (r,Vj)(vj E Tj, 1 < j < k) We call r the center, 
Ti, . . . ,Tfe the leaf-trees, ej the leaf-edges. The treestar TS is a specification of the data movement of X r , 
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which we will explain in detail shortly. Once a treestar is chosen, the corresponding data movement is added 
to our solution. The algorithm terminates when only one tree is left which will be our extended compression 
tree f. 

Let r be the center of TS and S be the subset indices of leaf-trees. We define the cost of TS (cost (TS)) 
to be 

min (c(r,{vj} jeS )H(X r ) + Y,H(X V] \X r )d(vj,BS)) 

where c(r, {Wj}jes0 is the minimum cost for sending X r from r to all vfs. Essentially, the first term cor- 
responds roughly to the cost of intra-source communication (raw data movement of X r ), denoted IC{TS) 
and the second roughly to the necessary communication (conditional data movement), denoted NC{TS). 
We say that the corresponding data movement is an implementation of the treestar. The cost function cQ 
differs for different cost models of the problem; we will specify its concrete form later. 

We define the cost effectiveness of the treestar TS to be ceff(TS) = C0S £^ S ^ where k is the number of 
leaf-trees in TS. In each iteration, we will try to find the most cost effective treestar. Let Mce-Treestar(Ti) 
be the procedure for finding the most (or approximately) cost effective treestar on T; L . The actual imple- 
mentation of the procedure Mce-Treestar will be described in detail in the discussion of each cost model. In 
some cases, finding the most cost-effective treestar is NP-hard and we can only approximate it. 

We now discuss the final data movement scheme and how the cost of the final solution has been properly 
accounted in the treestars that were chosen. Suppose in some iteration, a treestar TS is chosen in which the 
center node r sends its raw information to each Vj (vj E Tj ,j € S) (S is the set of indices of leaf-trees in 
TS). The definition of the cost function suggests that X Vj is compressed using X r at Vj, and the result is sent 
from Vj to BS. However, this may not be consistent with the extended compression tree T. In other words, 
some Vj may later become the parent of r, due to latter treestars being chosen, in T which implies that r 
should be compressed using Vj instead of the other way around. Suppose some leaf v p (v p E T p ,p E S) is 
the parent of r in T. The actual data movement scheme is determined as follows. We keep the raw data 
movement induced by TS unchanged, i.e, r still sends X r to each Vj(j E S). But now, X r \X Vp instead of 
X Vp \X r is computed on node v p and sent to the base station. Other leaves Vj(j ^ p) still compute and send 
X Vj \X r . It is easy to check this data movement scheme actually implements the extended compression tree 
f. 

For instance, in Figure node 3 is initially the parent of node 1 , but later node 4 becomes the parent of 
node 1, and in fact node 1 ships XijXj to the base station (and not X1IX3). Node 1 now being the parent 
of node 3 also compresses X% and sends X3IX1 to BS. Due to the fact that ^ < j^yjjf j < P, the actual 
data movement cost is at most f3 times the sum of the treestar costs. Thus every part of the communication 
cost incurred is counted in some treestar. We formalize the above observations as the following lemma: 

Lemma 2 Let TSi be the treestars we choose in iteration i for 1 < i < I. Then: cost(T) < PYli=i cost(T<Sj). 

The pseudocode for constructing T and the corresponding communication scheme is given in Algorithm 

m 

3.3 The Generic Analysis Framework 

Let Ti be the forest of n» trees before iteration i and T be the final extended compression tree. OPT is 
defined as the optimal restricted solution and OPTi as the optimal solution for the following problem: Find 
a extended compression tree that contains T% as a subgraph such that the cost for implementing all treestars 
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Algorithm 1: The Generic Greedy Framework 



Fx = UU{{^}}; 

while Ti is not a spanning tree do 
TSi = Mce — Treestar(T); 

Let E(TSi) be tree-edges of TSi and r is the center of TSi, 

-^i + E(TSi); 
T r ^T r + IC(TSi); 
_ i = i + l; 

T = T%, 

for eac/i directed edge e(i,j) G -E'(T) do 
if i f/je parent of j then 

Compute Xj\Xi at j and send it to BS; 

else 

Compute at j and send it to BS; 



in T — Fj is minimized. Clearly, OPT\ = OPT. Let T5j be the treestar computed in iteration i, with rrii 
tree components (the number of leaf-trees of TSi plus one). After ^ iterations (it is easy to see I must be 
smaller than n), the algorithm terminates. It is easy to see rtj+i = rtj — + 1 for i = 1, 1. We 

assume Mce-Treestar is guaranteed to find an a-approximate most cost-effective treestar. 

Lemma 3 For all i > 1, C J^HM < cost(OPT t ) 

Proof: Suppose the extended compression tree for OPTi is % that has T as a subgraph. OPTi consists of 
all data movement which implements all treestars defined by % — T%. These treestars, say TS\,T S\^ . . ., 
correspond to edge disjoint stars in %. Suppose TS* connects rrij tree components. Since each tree com- 
ponent of T is involved in some TS*, we can see J2j m j ^ n i- By the fact that TSi is a a-approximation 
of the most effective treestar, we can get 

coster^) ^ /costers*)! £.cost(rsj) 

< amin< ) < a 

"', J [ irij Z^j m j 

cost (OPT) 



< Ct- 



rl i 



The proof of the following lemma is omitted. 
Lemma 4 cost(OPTi) < cost(OPT). 
We are now ready to prove our main theorem. 

Theorem 3 Assuming we can compute an a approximation of the most cost-effective treestar and the 
bounded conditional entropy parameter is (3, there is a 2af5 2 H n approximate restricted solution. 
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Proof: The cost of our solution SOL is: 



st(SOL) <PY1 C0St ( Ts i) < I 3 a P 



cost(OPTi)mi 



CO 



I 



< af3 2 ■ cost(OPT) • V — < 2a0 2 H n cost{OPT) 

*- — ' 11 A 



The last inequality holds since: YlLi 7F < 2 Ei=i ^^T" ^ 2 #n- 
3.4 The WL-SG Model 

We first specify the cost function c(r, {vj}j^s) in the wireless sensor network model where we require the 
compression tree to be a subgraph of the communication graph and then give a polynomial time algorithm 
for finding the most cost-effective treestar. 

Recall c(r, {vj}j & s) is cos t of sending X r from r to all v/s. It is easy to see c(r, {vj}j^s) — H(X r ) 
since we require to be adjacent to r and a single broadcast of X r from r can accomplish the commu- 
nication. The most cost-effective treestar can be computed as follows: We fix a node r as the center to 
which all leaf-edges will connect. Assume T±,T2, . . . are sorted in a non-increasing order of h(r,Tj) = 
min^gy n jv(r) H(X v \X r )d(v, BS). h(r,Tj) captures the minimum cost of sending the data of some node 
in Tj n N(r) conditioned on X r to the base station. The most cost-effective treestar is determined simply 



We briefly analyze the running time of the algorithm. In the pre-processing step, we need to compute 
d(v,BS) for all v by running the single source shortest path algorithm which takes 0(n 2 ) time. In each 
iteration, for each candidate center r, sorting h(r,Tj)s needs 0(deg(r) log deg(r)) time. So, the most- 
effective treestar can be found in 0(|£?| logn) time. Since in each iteration, we merge at least two tree 
components, hence there are at most n iterations. Therefore, the total running time is 0(n\E\ log n). Using 
Lemma Q] and Theorem |3j we obtain the following. 

Theorem 4 We can compute a H n -approximation for the WL-SG model in 0{n\E\ logn) time. 



Here we don't put any restrictions on the compression trees. Thus, a source node is able to send the message 
to a set of nodes through a Steiner tree and the cost for sending one bit is the sum of the weights of all 
inner nodes of the Steiner tree (due to the broadcasting nature of wireless networks). In graph theoretic 
terminology, it is the cost of the connected dominating set that includes the source node and dominates all 
terminals. Formally, the cost of the treestar TS with node r as the center and S be the set of indices of the 
leaf-trees is defined to be: 



inating all nodes in its argument. 

Next, we discuss how to find the most effective treestar. We reduce the problem to the following version 
of the directed steiner tree problem Q. 



by 




3.5 The WL-NS Model 




here Cds() is the minimum connected set dom- 
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Definition 1 Given a weighted directed graph G, a specified root r G V(G), an integer k and a set X C V 
of terminals, the D-Steiner(fc, r, X) problem asks for a minimum weight directed tree rooted at r that can 
reach any k terminals in X. 

It has been shown that the D-Steiner(fc, r, X) problem can be approximated within a factor of 0(n e ) for any 
fixed e > within time 0(n°&) (2). 

The reduction is as follows. We first fix the center r. Then, we create a undirected node-weighted 
graph D. The weight of each node is H(X r ). For each node v, we create a copy v' with weight w(v') = 
H(X v \X r )d(v, BS) and add an edge (u, v') for each u G N(v). For each tree component Tj, we create a 
group §j = G Tj }. Then, we construct the directed edge- weighted graph . We replace each undirected 
edge with two directed edges of opposite directions. For each group gi, we add one node ti and edges (v, U) 
for all v G gi. The following standard trick will transfer the weight on nodes to directed edges. For each 
vertex v G V{D), we replace it with a directed edge (v',v") with the same weight as w(v) such that v' 
absorbs all incoming edges of v and v" takes all outgoing edges of v. We let all t"s be the terminals we 
want to connect. It is easy to see a directed steiner tree connecting k terminals in the new directed graph 
corresponds exactly to a treestar with k leaf-trees. 

Theorem 5 We develop an 0(/3 3 n e log n) -approximation for the WL-NS model for any fixed constant e > 
in 0(n°^) time. 

3.6 The Multicast-NS Model 

We consider the wired network model and do not require the compression tree to be a subgraph of Qc- First, 
we need to provide the concrete form of the cost function c(r, {^j}jgs) i.e., the cost of sending one unit of 
data from r to all v/s. In this model, it is easy to see the cost is the minimum Steiner tree connecting r and 
all Vj's. 

Suppose node r is the center of the treestar TS and S is the set of the indices of leaf-trees. According 
to the communication model and the general cost definition, the cost of TS here is defined as: 

min f Stn{r, { Vj } jeS )H(X r ) + ^ H(X Vj \X r )d{v J: BS) 

ing all nodes in its argument. 

Next we show how to find the most cost-effective treestar. We first fix the center r. Basically, our task is 
to find a set S of tree components such that C0S fc t |^'' S ^ is minimized. We will convert this problem to a variant 
of the group steiner tree problem. Actually, the following min-density variant has been considered in order 
to solve the general group steiner tree problem Q- 

Definition 2 Given an undirected graph G and a collection of vertex subsets {gi}, find a tree T in G such 
that \{ g C \gjxr-£$}\ ^ mm i m ized. 

Our reduction to the min-density group Steiner problem works as follows. For each node v, we create a 
copy v' and add an edge (v, v') with weight H(X v \X r )d(v, BS). For each tree component Tj, we define a 
group gj = {v'\v G Tj}. It is easy to see the cost of the Steiner tree spanning a set of groups is exactly the 
cost of the corresponding treestar. 

The min-density group steiner problem can be approximated within a factor of 0(|(log n) 2+e ) for any 

constant e > 0(3]]. The running time is 0{n ^). By plugging this result into our greedy framework and 
LemmaQ] we obtain the following theorem. 
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Theorem 6 There is an algorithm with an approximation factor o/0(^-(log n) 3+e )for the WN-NS model 
for any fixed constant e > in 0{n°^ ) time. 

4 The Unicast Model: Poly-Time Algorithm for Restricted Solutions 

We present a polynomial time algorithm for computing the optimal restricted solution under the unicast 
communication model, giving us a (2 + /^-approximation by Lemma Q] Further we can show that the 
algorithm will produce an optimal solution under the uniform entropy and conditional entropy assumption. 

Lemma 5 For unicast model with uniform entropy and conditional entropy assumption, there is always an 
optimal solution of the restricted form. 

Proof: We prove the lemma by modifying an optimal compression tree (and the associated data movement 
scheme) to a restricted solution without increasing the cost. Suppose T is an optimal compression tree. We 
repeatedly process the following types of edges until none is left. Take an edge (u, v) G T such that X V \X U 
is computed neither on node u nor node v (we call it a bad edge). Assume it is computed on node w. Note 
that we will never change raw data movement flow. We distinguish two cases: 

1. w is not in the subtree rooted at v. The new compression tree T' is formed by deleting (u, v) from T 
and add (w, v) to it. Instead of X V \X U , X V \X W is computed on w. Everything else is kept unchanged. 
It is not hard to see T is a valid compression tree and the new data movement scheme implements it. 

2. w is in the subtree rooted at v. In this case, we delete (u, v) from T and add (u, w) to obtain the new 
compression tree T . Accordingly, all edges in the path from v\.o w need to change their directions. 
The data movement scheme is modified as follows. Instead of sending X V \X U we send X W \X U from 
w (w has X u ). For each edge (x, y) in the path from v to w in T, we send X y \X x instead of X y \X x 
(at the same location). This modification corresponds to the change of the direction of (x,y). It is 
easy to see these modifications don't change the cost. 

It is not hard to see T is also an valid compression tree with one less bad edge in either case. Therefore, 
repeating the above process generates a restricted solution with the same cost. 

The same problem was previously considered by Cristescu et al. (3171, who also propose an approach 
that uses only second-order distributions (and makes the uniform entropy and conditional entropy assump- 
tion). They develop a 2(2 + \^2) -approximation for the problem. However, the solution space we consider 
in this paper is larger than the one they consider, in that it allows more freedom in choosing the compression 
treefl 

We note that our approach to find a optimal restricted solution is essentially the same as the one used by 
Rickenbach and Wattenhofer ||27l . They also made use of the minimum weight (out-)arborescence algorithm 
to compute an optimal data collection scheme under some conditions. Actually, it can be show that their 
solution space coincides with our restricted solution space, i.e., Xi\Xj should be computed either at i or j. 

Due to the significant resemblance to J27J, we only briefly sketch our algorithm. Consider a compression 
tree T, and an edge (u, v) £ T where u is the parent of v (u and v may not be adjacent in Qc). By induction, 
we assume that the base station can restore the value of X u (using its parent). To compress X v using the 
value of X u , we have two options: 

4 Another subtle difference is that, they require all the communication to be along only one routing tree; we don't require that 
from our solutions. 
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1. Node u sends the value of X u (= x u ) to v, v compresses X v using the conditional distribution Pr(X v \X u = 
x u ), and sends the result to the base station. The cost incurred is H(X u )d(u, v) + H(X v \X u )d(v, BS). 

2. Node v sends X v to u, u compresses X v given its value of X u , and transmits the result to the base 
station BS. The cost incurred is H(X v )d(u, v) + H(X v \X u )d(u, BS). 

We observe that the above choice has no impact on restoring information of any other node and thus it 
can be made independently for each pair of nodes. 

The discussion yields the following algorithm. Construct a weighted directed graph G with the same 
vertex set as Qq. For each pair of vertices in the communication graph Qc, we add two directed edges. The 
cost of the directed edge (u, v) is set to be: 

c(u,v) = mm{H(X u )d(u,v) + H(X v \X u )d(v,BS),H(X v )d(u,v) + H(X v \X u )d(u, BS)} 

Similarly we add an edge (v,u). Essentially, c(u,v) captures the minimal cost incurred in using X u to 
compress X v (assuming a restricted solution). Further, we add edges (BS, v) from BS to every node v with 
cost c(BS, v) = H(X v )d(v, BS). 

We then compute a minimum weight (out-)arborescence T (directed spanning tree) rooted at BS which 
serves as our final compression tree ifTTTl . The actual data transmission plan is easily constructed from the 
above discussion. 



5 Experimental Evaluation 

We conducted a comprehensive simulation study over several datasets comparing the performance of several 
approaches for data collection. Our results illustrate that our algorithms can exploit the spatial correlations 
in the data effectively, and perform comparably to the DSC lower bound. Due to space constraints, we 
present results only for the WL model (broadcast communication) over a few representative settings. 

Comparison systems: 

We compare the following data collection methods. 

- IND (Sec. 12.21 ): Each node compresses its data independently of the others. 

- Cluster (Sec. 12.21 ): The clusters are chosen using the greedy algorithm presented in Chu et al. |6l - 
we start with each node being in its own cluster, and combine clusters greedily, till no improvement is 
observed. 

- DSC: the theoretical lower bound is plotted (Sec. I2.2I ). 

- TreeStar: Our algorithm, presented in Sec. 13.41 augmented with a greedy local improvement stej@. 
For the TreeStar algorithm, we also show the NC cost (which measures how well the compression tree 
chosen by TreeStar approximates the original distribution). This cost is lower bounded by the cost of DSC 
(which uses the best possible compression tree). 

Rainfall Data: 

For our first set of experiments, we use an analytical expression of the entropy that was derived by Pattern et 
al. 11211 for a data set containing precipitation data collected in the states of Washington and Oregon during 
1949-1994 |[29l . All the nodes have uniform entropy (H(Xi) = h), and the conditional entropies are given 
by: 

After the TreeStar algorithm finds a feasible solution, adding a few redundant local broadcasts can cause significant reduction 
in the NC cost. We greedily add such local broadcasts till the solution stops improving. 
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Figure 4: Results of the experimental evaluation over the Rainfall data 



H(X i \X j ) = (l-^ m) )h 
where dist(i,j) is the Euclidean distance between the sensors i and j. The parameter c controls the corre- 
lation. For small values of c, H(Xj\Xj) « h (indicating independence), but as c increases, the conditional 
entropy approaches 0. 

Figure |4] shows the results for 3 synthetically generated sensor networks. We plot the total communica- 
tion cost for each of the above approaches normalized by the cost of IND. The first plot shows the results 
for a 100-node network where the sensor nodes are arranged in a uniform grid. Since the conditional en- 
tropies depend only on the distance, for any two adjacent nodes i, j, H(Xi\Xj) is constant. Because of this, 
TreeStar-NC is always equal to DSC in this case. As we can see, the extra cost (of local broadcasts) is quite 
small, and overall TreeStar performs much better than either Cluster or IND, and performs nearly as well as 
DSC. 

We then ran experiments on randomly generated sensor networks, both containing 100 nodes each. The 
nodes were randomly placed in either a 200x200 square or a 300x30 rectangle, and communication links 
were added between nodes that were sufficiently close to each other (distance < 30). For each plotted data 
point, we ran the algorithms on 10 randomly chosen networks, and averaged the results. As we can see in 
Figures [4] (ii) and (iii), the relative performance of the algorithms is quite similar to the first experiment. 
Note that, because the conditional entropies are not uniform, TreeStar-NC cost was typically somewhat 
higher than DSC. The cost of local broadcasts for TreeStar was again relatively low. 

Gaussian approximation to the Intel Lab Data: 

For our second set of experiments, we used multivariate Gaussian models learned over the temperature data 
collected at an indoor, 49-node deployment at the Intel Research Lab, Berkele)@. Separate models were 
learned for each hour of day [9] and we show results for 6 of those. After learning the Gaussian model, we 
use the differential entropy of these Gaussians for comparing the data collection costs. We use the aggregated 
connectivity data available with the dataset to simulate different connectivity behavior: in one case, we put 
communication links between nodes where the success probability was > .35, resulting in somewhat sparse 
network, whereas in the other case, we used a threshold of .20. 

Figure [5] shows the comparative results for this dataset. The dataset does not exhibit very strong spatial 
correlations: as we can see, optimal exploitation of the spatial correlations (using DSC) can only result in 
at best a factor of 4 or 5 improvement over IND (which ignores the correlations). However, TreeStar still 
performs very well compared to the lower bound on the data collection cost, and much better than the Cluster 
approach. Different connectivity behavior does not affect the relative performance of the algorithms much, 
with the low-connectivity network consistently incurring about twice as much total energy cost compared to 
the high-connectivity network. 

6 |http ://db.csail.mit. edu/labdata/labdata . html| 
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Figure 5: Results for the Gaussian dataset 
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6 Related Work 



Wireless sensor networks have been a very active area of research in recent years (see [2 for a survey). 
Due to space constraints, we only discuss some of the most closely related work on data collection in 
sensor networks here. Directed diffusion ifTTl . TinyDB |[20l . LEACH lfl6l are some of the general purpose 
data collection mechanisms that have been proposed in the literature. The focus of that work has been 
on designing protocols and/or declarative interfaces to collect data, and not on optimizing continuous data 
collection. Aside from the works discussed earlier in the paper ||2T1 l6l l8l. the BBQ system (9l also uses 
a predictive modeling-based approach to collect data from a sensor network. However, the BBQ system 
only provides probabilistic, approximate answers to queries, without any guarantees on the correctness. 
Scaglione and Servetto |[23l also consider the interdependence of routing and data compression, but the 
problem they focus on (getting all data to all nodes) is different from the problem we address. In seminal 
work, Gupta and Kumar lfT5l proved that the transport capacity of a random wireless network scales only 
as 0{^/n), where n is the number of sensor nodes. Although this seriously limits the scalability of sensor 
networks in some domains, in the kinds of applications we are looking at, the bandwidth or the rate is rarely 
the limiting factor; to be able to last a long time, the sensor nodes are typically almost always in sleep mode. 

Several approaches not based on predictive modeling have also been proposed for data collection in 
sensor networks or distributed environments. For example, constraint chaining [24] is a suppression-based 
exact data collection approach that monitors a minimal set of node and edge constraints to ensure correct 
recovery of the values at the base station. 



7 Conclusions 

Designing practical data collection protocols that can optimally exploit the strong spatial correlations typi- 
cally observed in a given sensor network remains an open problem. In this paper, we considered this problem 
with the restriction that the data collection protocol can only utilize second-order marginal or conditional 
distributions. We analyzed the problem, and drew strong connections to the previously studied weakly- 
connected dominating set problem. This enabled us to develop a greedy framework for approximating this 
problem under various different communication model or solution space settings. Although we are not able 
to obtain constant factor approximations, our empirical study showed that our approach performs very well 
compared to the DSC lower bound. We observe that the worst case for the problem appears to be when the 
conditional entropies are close to zero, and that we can get better approximation bounds if we lower-bound 
the conditional entropies. Future research directions include generalizing our approach to consider higher- 
order marginal and conditional distributions, and improving the approximation bounds by incorporating 
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lower bounds on the conditional entropy values. 



References 

[1] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. Wireless sensor networks: a survey. Computer 
Networks, 2002. 

[2] M. Charikar, C. Chekuri, T. Cheung, Z. Dai, A. Goel, and M. Li. Approximation algorithm for directed Steiner 
problem. Journal of Algorithms, 33(1):73— 91, 1999. 

[3] C. Chekuri, G. Even, and G. Kortsarz. A greedy approximation algorithm for the group Steiner problem. Discrete 
Applied Mathematics, 154(1): 15-34, 2006. 

[4] Y. Chen and A. L. Liestman. Approximating minimum size weakly-connected dominating sets for clustering 
mobile ad hoc networks. In Mobihoc, pages 165-172, 2002. 

[5] C.K. Chow and C.N. Liu. Approximating Discrete Probability Distributions with Dependence Trees. IEEE 
Transactions on Information Theory, (3):462-467, 1968. 

[6] D. Chu, A. Deshpande, J. Hellerstein, and W. Hong. Approximate data collection in sensor networks using 
probabilistic models. In International Conference on Data Engineering (ICDE), 2006. 

[7] R. Cristescu, B. Beferull-Lozano, and M. Vetterli. Networked slepian-wolf: Theory and algorithms. In EWSN, 
2004. 

[8] R. Cristescu, B. Beferull-Lozano, M. Vetterli, and R. Wattenhofer. Network correlated data gathering with 
explicit communication: Np-completeness and algorithms. IEEE/ACM Transactions on Networking, 14(1):41— 
54, 2006. 

[9] A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-driven data acquisition in sensor 
networks. In VLDB, 2004. 

[10] Uriel Feige. A threshold of In n for approximating set cover. J. ACM, 45(4):634-652, 1998. 

[11] H. N. Gabow, Z. Galil, T. Spencer, and R. E. Tarjan. Efficient algorithms for finding minimum spanning trees in 
undirected and directed graphs. Combinatorica, 6(2): 109-122, 1986. 

[12] A. Goel and D. Estrin. Simultaneous optimization for concave costs: Single sink aggregation or single source 
buy-at-bulk. In ACM-SIAM Symposium on Discrete Algorithms (SODA), 2003. 

[13] S. Guha and S. Khuller. Approximation algorithms for connected dominating sets. Algorithmica, 20(4), 1998. 

[14] H. Gupta, V. Navda, S. Das, and V. Chowdhary. Efficient gathering of correlated data in sensor networks. In 
MobiHoc, 2005. 

[15] R Gupta and R R. Kumar. The capacity of wireless networks. IEEE Transactions on Information Theory, 46, 
2000. 

[16] W. R. Heinzelman, A. Chandrakasan, and H. Balakrishnan. Energy-efficient communication protocol for wireless 
microsensor networks. In HICSS, 2000. 

[17] C. Intanagonwiwat, R. Govindan, and D. Estrin. Directed diffusion: A scalable and robust communication 
paradigm for sensor networks. In ACM MobiCOM, 2000. 

[18] Y. Kotidis. Snapshot queries: Towards data-centric sensor networks. In ICDE, 2005. 

[19] J. Liu, M. Adler, D. Towsley, and C. Zhang. On optimal communication cost for gathering correlated data 
through wireless sensor networks. In Proceedings of ACM MobiCOM, 2006. 

[20] Samuel Madden, Wei Hong, Joseph M. Hellerstein, and Michael Franklin. TinyDB web page. 
http://telegraph.cs.berkeley.edu/tinydb. 



19 



[21] S. Pattern, B. Krishnamachari, and R. Govindan. The impact of spatial correlation on routing with compression 
in wireless sensor networks. In IPSN, 2004. 

[22] S. Pradhan and K. Ramchandran. Distributed source coding using syndromes (DISCUS): Design and construc- 
tion. IEEE Trans. Information Theory, 2003. 

[23] A. Scaglione and S. Servetto. On the interdependence of routing and data compression in multi-hop sensor 
networks. In Mobicom, 2002. 

[24] A. Silberstein, R. Braynard, and J. Yang. Constraint-chaining: On energy-efficient continuous monitoring in 
sensor networks. In SIGMOD, 2006. 

[25] D. Slepian and J Wolf. Noiseless coding of correlated information sources. IEEE Transactions on Information 
Theory, 19(4), 1973. 

[26] Xun Su. A combinatorial algorithmic approach to energy efficient information collection in wireless sensor 
networks. ACM Trans. Sen. Netw., 3(1):6, 2007. 

[27] Pascal von Rickenbach and Roger Wattenhofer. Gathering correlated data in sensor networks. In In Proc. of the 
ACM Joint Workshop on Foundations of Mobile Computing (DIALM-POMC), pages 60-66, 2004. 

[28] L. Wang and A. Deshpande. Predictive modeling-based data collection in wireless sensor networks. In EWSN, 
2008. 

[29] M. Widmann and C. Bretherton. 50 km resolution daily precipitation for the pacific northwest, 2003. 
http://www.jisao.washington.edu/datajets/widmann. 

[30] A. D. Wyner and J. Ziv. The rate-distortion function for source coding with side information at the decoder. 
IEEE Transactions on Information Theory, 1976. 

[31] Z. Xiong, A. D. Liveris, and S. Cheng. Distributed source coding for sensor networks. IEEE Signal Processing 
Magazine, 21, 2004. 



20 



